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ABSTRACT 

Preserving user privacy is paramount when it comes to pub¬ 
licly disclosed datasets that contain fine-grained data about 
large populations. The problem is especially critical in the 
case of mobile traffic datasets collected by cellular oper¬ 
ators, as they feature elevate subscriber trajectory unique¬ 
ness and they are resistant to anonymization through spa- 
tiotemporal generalization. In this work, we investigate the 
/.-anonymizability of trajectories in two large-scale mobile 
traffic datasets, by means of a novel dedicated measure. Our 
results are in agreement with those of previous analyses, 
however they also provide additional insights on the reasons 
behind the poor anonimizability of mobile traffic datasets. 
As such, our study is a step forward in the direction of a 
more robust dataset anonymization. 

1. INTRODUCTION 

Public disclosure of datasets containing micro-data , 
i.e., information on precise individuals, is an increas¬ 
ingly frequent practice. Such datasets are collected in a 
number of different ways, including surveys, transaction 
recorders, positioning data loggers, mobile applications, 
and communicaiton network probes. They yield fine¬ 
grained data about large populations that has proven 
critical to seminal studies in a number of research fields. 

However, preserving user privacy in publicly accessi¬ 
ble micro-data datasets is currently an open problem. 
Publishing an incorrectly anonymized dataset may dis¬ 
close sensible information about specific users. This 
has been repeatedly proven in the past. One of the first 
and best known attempts at re-identification of badly 
anonymized datasets was carried out by then MIT grad¬ 
uate student Latanya Sweeney mi in 1996. By using 
a database of medical records released by an insurance 
company and the voter roll for the city of Cambridge 
(MA), purchased for 20 US dollars, Dr. Sweeney could 
successfully re-identify the full medical history of the 
then governor of Massachusetts, William Weld. She 
even sent the governor full health records, including di¬ 
agnoses and prescriptions, to his office. A later, yet 
equally famous experiment was performed by Narayanan 
et al. (3] on a dataset released by Netflix for a data- 
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mining contest, which was cross-correlated with a web 
scraping of the popular IMDB website. The authors 
were able to match two users from both datasets re¬ 
vealing, e.g., their political views. 

Mobile traffic datasets include micro-data collected 
at different locations of the cellular network infrastruc¬ 
ture, concerning the movements and traffic generated by 
thousands to millions of subscribers, typically for long 
timespans in the order of weeks or months. They have 
become a paramount instrument in large-scale analyses 
across disciplines such as sociology, demography, epi¬ 
demiology, or computer science. Unfortunately, mobile 
traffic datasets may also be prone to attacks on individ¬ 
ual privacy. Specifically, they suffer from the following 
two issues. 

1. Elevate uniqueness. Mobile subscribers have 
very distinctive patterns that often make them 
unique even within a very large population. Zang 
and Bolot [4] showed that 50% of the mobile sub¬ 
scribers in a 25 million-strong dataset could be 
uniquely detected with minimal knowledge about 
their movement patterns, namely the three loca¬ 
tions they visit the most frequently. The result 
was corroborated by de Montjoye et al. [5], who 
demonstrated how an individual can be pinpointed 
among 1.5 million other mobile customers with a 
probability almost equal to one, by just knowing 
five random spatiotemporal points contained in his 
mobile traffic data. 

Uniqueness does not implies identifiability, since the 
sole knowledge of a unique subscriber trajectory can¬ 
not disclose the subscriber’s identity. Building that cor¬ 
respondence requires instead sensible side information 
and cross-database analyses similar to those carried out 
on medical or Netflix records. To date, there has been 
no actual demonstration of subscriber re-identification 
from mobile traffic datasets using such techniques - and 
our study does not change that situation. Still, unique¬ 
ness may be a first step towards re-identification, and 
whether this represents a threat to user privacy is an 
open topic for discussion 
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In such a context, the standard, safe approach to en¬ 
sure data confidentiality relies on non-technical solu¬ 
tions, i.e., non-disclosure agreements that well define 
the scope of the activities (e.g., fundamental research 
only) carried out on the datasets, and that prevent open 
disclosure of the data or results without prior verifica¬ 
tion by the relevant authorities. This is, for instance, 
the solution adopted in the case of the mobile traffic 
information we will consider in Sec. [3] 

Clearly, this practice can strongly limit the availabil¬ 
ity of mobile traffic datasets, as well as the reproducibil¬ 
ity of related research. Mitigating the uniqueness of 
subscriber trajectories becomes then a very desirable fa¬ 
cility that can entail more privacy-preserving datasets, 
and favor their open circulation. It is however at this 
point that the second problem of mobile traffic datasets 
comes into play. 

2. Low anonymizability. The legacy solution to re¬ 
duce uniqueness in micro-data datasets is general¬ 
ization and suppression. However, previous stud¬ 
ies showed that blurring users in the crowd, by 
reducing the spatial and temporal granularity of 
the data, is hardly a solution in the case of mo¬ 
bile traffic datasets. Zang and Bolot [4] found 
that reliable anonymization is attained only un¬ 
der very coarse spatial aggregation, namely when 
the mobile subscriber location granularity is re¬ 
duced to the city level. Similarly, de Montjoye et 
al. [5] proved that a power-law relationship exists 
between uniqueness and spatiotemporal aggrega¬ 
tion of mobile traffic. This implies that privacy is 
increasingly hard to ensure as the resolution of a 
dataset is reduced. In conclusion, not only mobile 
traffic datasets yield highly unique trjectories, but 
the latter are also hard to anonymize. Ensuring 
individual privacy risks to lower the level of de¬ 
tail of such datasets to the point that they are not 
informative anymore. 

In this work, we aim at better investigating the rea¬ 
sons behind such inconvenient properties of mobile traf¬ 
fic datasets. We focus on anonymizability, since it is 
a more revealing feature: multiple datasets that fea¬ 
ture similar trajectory uniqueness may be more or less 
difficult to anonymize. Attaining our objective brings 
along the following contributions: (i) we define a mea¬ 
sure of the level of anonymizability of mobile traffic 
datasets, in Sec. [2] (ii) we provide a first assessment 
of the anonymizability of two large-scale mobile traffic 
datasets, in Sec. [3j (iii) we unveil the cause of elevate 
uniqueness and poor anonymizability in such datasets, 
i.e., the heavy tail of the temporal diversity among sub¬ 
scriber mobility patterns, in Sec. 01 Finally, Sec. [5] con¬ 
cludes the paper. 


Table 1: Standard micro-data database format. 


Pseudo-id 

Gender 

Age 

ZIP 

Degree 

Income 


00013701 

Male 

21 

77005 

Bachelor 

13,000 


08936402 

Male 

37 

77065 

Master’s 

90,000 


42330327 

Female 

60 

89123 

High School 

46,000 



Table 2: Mobile traffic database format. 


Pseudo-id 

Spatiotemporal samples (fingerprint) 

a 

ci ,8 

c 2 ,14 

c 3 ,17 


b 

04,8 

C5,15 

C6,15 

... | ci3,15 | ci4,16 | ci5,17 | 

c 

Cl6;7 

ci7,20 



2. HOW ANONYMIZABLE IS YOUR 
MOBILE TRAFFIC FINGERPRINT? 

In this section, we first define in a formal way the 
problem of user uniqueness in mobile traffic datasets, 
in Sec. 12.11 Then, we introduce the proposed measure 
of anonymizability, in Sec. 12.21 

2.1 Our problem 

In order to properly define the problem we target, we 
need to introduce the notion of mobile traffic fingerprint 
that is at the base of the mobile traffic dataset format. 
We also need to specify the type of anonymity we con¬ 
sider in our case, £:-anonymity. Next, we discuss these 
aspects of the problem. 

2.1.1 Mobile traffic fingerprint and dataset 

Traditional micro-data databases are structured into 
matrices where each row maps to one individual, and 
each column to an attribute. An example is provided 
in Tab[Q Individuals are associated to one identifier , 
i.e., a value that uniquely pinpoints the user across 
datasets (e.g., his complete name, social number, or 
passport number). Since identifiers allow direct identi¬ 
fication and immediate cross-database correlation, they 
are never disclosed. Instead, they are replaced by a 
pseudo-identifier, which is again unique for each individ¬ 
ual, but changes across datasets (e.g., a random string 
substituting the actual identifier). Then, standard re¬ 
identification attacks leverage quasi-identifiers, i.e., a 
sequence of known attributes of one user (e.g., the age, 
gender, ZIP code, etc.) to recognize the user in the 
dataset. If successful, the attacker has then access to 
the complete record of the target user. This knowledge 
can directly include sensitive attributes, i.e., items that 
should not be disclosed because they may pertain to the 
personal sphere of the individual (e.g., diseases, politi¬ 
cal or religious views, sexual orientation, etc.). It can 
also be exploited for further cross-database correlation 
so as to extract additional private information about 
the user. 
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The same model directly applies to the case of mo¬ 
bile traffic datasets. However, the database semantics 
make all the difference here: while mobile users are the 
obvious individuals whose privacy we want to protect, 
attributes are now sequences of spatiotemporal samples. 
Each sample is the result of an event that the cellular 
network associated to the user. An illustration is pro¬ 
vided in Fig. Hal which portrays the trajectories of three 
mobile customers, denoted with pseudo-identifiers a, b, 
and c, respectively, across an urban area. User a inter¬ 
acts with the radio access infrastructure at 8 am, while 
he is in cell c\ along his trajectory. Then, he triggers 
additional mobile traffic activities at 2 pm, while lo¬ 
cated in a cell C 2 in the city center, and at 5 pm, from 
a cell C 3 in the South-East city outskirts. The same 
goes for users b and c. All these spatiotemporal sam¬ 
ples are recorded by the mobile operatoiQ and constitute 
the mobile traffic fingerprint of the user. The resulting 
database has a format such as that in Tab|2l where 
subscriber identifiers are replaced by pseudo-identifiers, 
and each element of a user’s fingerprint is a cell and 
hourly timestamp pair. 

2.1.2 k-anonymity in mobile traffic 

In order to preserve user privacy in micro-data, one 
has to ensure that no individual can be uniquely pin¬ 
pointed in a dataset. This principle has led to the defi¬ 
nition of multiple notions of non-uniqueness, such as fc- 
anonymity [T] , ^-diversity [Sj and f-closeness 0 • Among 
those, fc-anonymity is the baseline criterion, to which l- 
diversity or t-closeness add further security layers that 
cope with sensitive attributes or cross-database correla¬ 
tion. More precisely, fc-anonymity ensures that, for each 
individual, the set of attributes (or its quasi-identifier 
subset) is identical to that of at least other fc -1 users. 
In other words, each individual is always hidden in a 
crowd of fc, and thus he cannot be uniquely identified 
among such other users. 

Granting fc-anonymity in micro-data databases im¬ 
plies generalizing and suppressing data. As an exam¬ 
ple, in order to ensure 2-anonymity on the age and ZIP 
code attributes for the first user in Tab.Q] , one can 
aggregate the age in twenty-year ranges, and the ZIP 
codes in three-number ranges: both the first and sec¬ 
ond user end up with a (20,40) age and 770** ZIP 
code, which makes them both 2-anonymous. Clearly, 

1 The actual precision of the information recorded, both in 
space and in time, can depend significantly on the nature of 
the probes used by the operator. Typically, probes located 
closer to the radio access can capture more events at a finer 
granularity, but require more extensive deployments to at¬ 
tain a similar coverage than lower-precision probes located 
in the mobile network core. In all cases, our discussion is 
independent of the mobile traffic data collection technique, 
and all the analyses performed in this work can be applied 
to any type of mobile traffic data. 


the process is lossy, since the information granularity 
is reduced. Many efficient algorithms have been pro¬ 
posed that achieve fc-anonymity in legacy micro-data 
databases, while minimizing information loss j 10] . 

Also in mobile traffic datasets, fc-anonymity is re¬ 
garded as a best practice, and data aggregation is the 
common approach to achieve it II HI- In this case, 
one has to ensure that the fingerprint of each subscri¬ 
ber is identical to that of at least other fc -1 mobile 
users in the same dataset. We remark that previous 
works have typically considered a model of attacker who 
only has partial knowledge of the subscribers’ finger¬ 
prints, e.g., most popular locations [4] or random sam¬ 
ples [5j. In order to counter such a attack model, a 
partial fc-anonymization, targeting the limited informa¬ 
tion owned be the attacker, would be sufficient. How¬ 
ever, we are interested in a general solution, so we do 
not make any assumption on the precise knowledge of 
the attacker, which can be diverse and possibly broad. 
Thus, fc-anonymizing the whole fingerprint of each sub¬ 
scriber in the mobile traffic dataset is the only way to 
deterministically ensure mobile user privacy. 

Both spatial and temporal aggregations can be lever¬ 
aged to attain this goal. Examples are provided in 
Fig. HE] and Fig.[lcl In Fig.EEl cells are aggregated in 
large sets that roughly map to the nine major neigh¬ 
borhoods of the urban area; also, time is aggregated 
in two-hour intervals. The reduction of spatiotemporal 
granularity allows 2 -anonymizing mobile users a and 
b: both have now a fingerprint composed by samples 
(V,8-9), (111,14-15), and (VII,16-17). Userc has 
instead a different footprint, with samples (IV,6-7) 
and (111,20-21). If we need to 3-anonymize all three 
mobile customers in the example, then a further gener¬ 
alization is required, as in Fig.[lc] There, the metropoli¬ 
tan region is divided in West and East halves, and only 
two time intervals, before and after noon, are consid¬ 
ered. The result is that all subscribers a, 6 , and c have 
identical fingerprints (West, 1-12) and (East, 13-24). 
Clearly, this level of anonymization comes at a high cost 
in terms of information loss, as the location data is very 
coarse both in space and time. 

This is precisely the problem of low anonymizability 
of mobile traffic datasets unveiled by previous works [E 
[5]: even guaranteeing 2-anonymization in a very large 
population requires severe reductions of the spatiotem¬ 
poral granularity, which limits the usability of the data. 

2.2 A measure of anonymizability 

We intend to devise a measure of anonymizability 
that is based on the fc-anonymity criterion. Thus, our 
proposed measure evaluates the effort, in terms of data 
aggregation, needed to make a user indistinguishable 
from fc -1 other subscribers. 

We start by defining the distance between two spa- 
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(b) Aggregated 


(c) More aggregated 




Figure 1: Example of mobile traffic fingerprints of three subscribers, (a) Initial dataset granularity: user locations are 
represented at cell level, and the temporal information has a hourly precision, (b) First aggregation level: positions 
are recorded at each neighborhood, and the time granularity is reduced to two hours, (c) Second aggregation level: 
location data is limited to Eastern or Western half of the city, and the time information is merged over 12 hours. 


tiotemporal samples in the mobile traffic fingerprints of 
two mobile users. Each sample is composed of a spa¬ 
tial information (e.g., the cell location) and a temporal 
information (e.g., the timestamp). The distance must 
keep into account both dimensions. A generic formu¬ 
lation of the distance between the i-th sample of a’s 
fingerprint, (s“,f“), and the j-th sample of b 1 s finger¬ 
print, (s b ,t b ), is 


dab{i,j) = w s S s ( Si,s b ) +w t 5 t (t?,t b ) ■ (!) 

Here, S s and St are functions that determine the dis¬ 
tance along the spatial and temporal dimensions, re¬ 
spectively. The former thus operates on the spatial in¬ 
formation in the two samples, s“ and s b , and the lat¬ 
ter on the temporal information, and t b . The fac¬ 
tors w s and Wt weigth the spatial and temporal contri¬ 
butions in 0 - In the following, we will assume that 
the two dimensional have the same importance, thus 
w s =w t = 1 / 2 . 

We shape the 5 S and St functions by considering that 
both spatial and temporal aggregations induce a loss of 
information that is linear with the decrease of granular¬ 
ity. However, above a given spatial or temporal thresh¬ 
old, the information loss is so severe that the data is 
not usable anymore. As a result, the functions can be 
expressed as 


Ss (s?,s b ) = 


dist (s“,s b ) 


if dist (s“, s b ) < 
otherwise, 


( 2 ) 


and 


*(*?>*$) 



if I i? 


t b\ < firnax 


( 3 ) 


I 1 otherwise. 

In 0, dist(s°;,s b ) = | sf.x - s b .x\ + |s“.y - s b .y\ is 
the Taxicab distance El between the spatial compo¬ 
nents of the samples, whose coordinates are denoted as 
x and y in a valid map projection system. Both func¬ 
tions fulfill the properties of distances, i.e., are positive 
definite, symmetric, and satisfy the triangle inequality. 
They range from 0 (samples are identical from a spatial 
or temporal viewpoint) to 1 (samples are at or beyond 
the maximum meaningful aggregation threshold). Con¬ 
cerning the values of the thresholds, in the following we 
will consider that the aggregation limits beyond which 
the information deprivation is excessive are 20 km for 
the spatial dimension (i.e., the size of a city, beyond 
which all intra-urban movements are lost) and 8 hours 
(beyond which the night, working hours, and evening 
periods are merged together). 

The sample distance in 0 can be used to define the 
distance among the whole fingerprints of two mobile 
subscribers a and &, as 


^ab — < 


1 

n a 

1 

n b 


n a 

E min 

k=l,...,n b 

h— 1 
n b 

E min 

k—l,...,n a 

h= 1 






if n a > n b 
otherwise. 


( 4 ) 


Here, n a and n b are the cardinalities of the fingerprints 
of a and b, respectively. The expression in 0 takes the 
longer fingerprint between the two, and finds, for each 
sample, the sample at minimum distance in the shorter 
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fingerprint. The resulting A Q b is the average among all 
such sample distances, and A a {, = A;, a , Va, b. 

The measure of anonymizability of a generic mobile 
user a can be mapped, under the fc-anonymity criterion, 
to the average distance of his fingerprint from those of 
the nearest fc-1 other users. Formally 

fceNj ; -1 

where N„ _1 is the set of k — 1 users b with the smallest 
fingerprint distances to that of a. 

The expression in © returns a measure A* S [0,1] 
that indicates how hard it is to hide subscriber a in a 
the crowd of k users. If A„ = 0, then the user is al¬ 
ready fc-anonymized in the dataset. If A(j = 1, the user 
is completely isolated, i.e., no sample in the fingerprints 
of all other subscribers is within the spatial and tempo¬ 
ral thresholds, S™ ax and S™ 0,00 , from any samples of a’s 
fingerprint. 

3. TWO MOBILE TRAFFIC USE CASES 

We employ the proposed measure to assess the level 
of anonymizability of fingerprints present in two mobile 
traffic datasets released by Orange in the framework of 
the Data for Development Challenge. In order to allow 
for a fair comparison, we preprocessed the datasets so 
as to make them more homogeneous. 

• Ivory Coast. Released for the 2012 Challenge, 
this dataset describes five months of Call Detail 
Records (CDR) over the whole the African na¬ 
tion of Ivory Coast. We used the high spatial 
resolution dataset, containing the complete spatio 
temporal trajectories for a subset of 50,000 ran¬ 
domly selected users that are changed every two 
weeks. Thus, the dataset contains information 
about 10 2-weeks periods overall. We performed 
a preliminary screening, discarding the most dis¬ 
perse trajectories, keeping the users that have at 
least one spatio-temporal point per day. Then, we 
merged all the user that met this criteria in a sin¬ 
gle dataset, so as to achieve a meaningful size of 
around 82,000 users. This dataset is indicated as 
d4d-civ in the following. 

• Senegal. The 2014 Challenge dataset is derived 
from CDR collected over the whole Senegal for one 
year. We used the fine-grained mobility dataset, 
containing a randomly selected subset of around 
300,000 users over a rolling 2-week period, for a to¬ 
tal of 25 periods. We did not filter out subscribers, 
since the dataset is already limited to users that 
are active for more than 75% of the 2-week time 
span. In our study, we consider one representative 
2-week period among those available. This dataset 
is referred to as d4d-sen in the following. 



Figure 2: CDF of the anonymizability measure, under 
the 2-anonymity criterion, in the d4d-civ and d4d-sen 
mobile traffic datasets. 

In both the mobile traffic datasets, the information about 
the user positior@ is provided as a latitude and longi¬ 
tude pair. We projected the latter in a two-dimensional 
coordinate system using the Lambert azimuthal equal- 
area projection. We then discretize the resulting posi¬ 
tions on a 100-m regular grid, which represents the max¬ 
imum spatial granularity we consideo As far as the 
temporal dimension is concerned, the maximum preci¬ 
sion granted by both datasets is one minute, and this is 
also our finest time granularity. 

4. RESULTS 

The measure of anoymizability in © can be intended 
as a dissimilarity measure, and employed in legacy def¬ 
initions used to understand micro-data database spar¬ 
sity, e.g., (e,6)-sparsity [3]. However, these defini¬ 
tions are less informative than the complete distribution 
of the anonymizability measure. Thus, in this section, 
we employ Cumulative Distribution Functions (CDF) 
of the measure in © in order to assess the anonymiz¬ 
ability of the two datasets presented before. 

4.1 The good: anonymity is close to reach 

Our basic result is shown in Fig.[2j The plot portrays 
the CDF of the anonymizability measure computed on 
all users in the two reference mobile traffic datasets, 
d4d-civ and d4d-sen, when considering 2-anonymity 
as the privacy criterion. 

We observe that the two curves are quite similar, and 
both are at zero in the x-axis origin. This means that no 
single mobile subscriber is 2-anonymous in either of the 
original datasets. Since similar observations were made 
on different data BUS], our results seem to confirm that 

2 The spatial information maps to the antenna location in 
d4d-civ, and to a random point within the voronoi cell as¬ 
sociated to the antenna in d4d-sen. 

3 At 100-m spatial granularity, each square cell contains 
at most one antenna or voronoi location from the original 
dataset. In other words, this discretization does not implies 
any spatial aggregation. 
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A* A‘ 

(a) d4d-civ (b) d4d-sen 

Figure 3: CDF of the anonymizability measure, for 
varying k of the fc-anonymity criterion, in the d4d-civ 
and d4d-sen mobile traffic datasets. 

the elevate uniqueness of subscriber trajectories is an 
intrinsic property of any mobile traffic dataset, and not 
just a specificity of those we analyse in this study. 

More interestingly, the probability mass gathered in 
both cases in the 0 . 1 - 0.2 range, i.e., it is quite close to 
the origin. This is good news, since it implies that the 
average aggregation effort needed to achieve 2 -anonymity 
is not elevate. As an example, 50% of the users in the 
d4d-civ dataset have a measure 0.09 or less, which 
maps, on average, to a combined spatiotemporal ag¬ 
gregation of less than one km and little more than 20 
minutes. In other words, the result seems to suggest 
that half of the individuals in the dataset can be 2 - 
anonymized if the spatial granularity is decreased to 
1 km, and the temporal precision is reduced to around 
20 minutes. Similar considerations hold in the d4d-sen 
case, where, e.g., 80% of the dataset population has a 
measure 0.17 or less. Such a measure is the result of 
average spatial and temporal distances of 1.7 km and 
41 minutes from 2-anonymity. 

One may wonder how more stringent privacy require¬ 
ments affect these results. Fig. [3] shows the evolution of 
the anonymizability of the two datasets when k varies 
from 2 to 100. As expected, higher values of k require 
that a user is hidden in a larger crowd, and thus shift 
the distributions towards the right, implying the need 
for a more coarse aggregation. However, quite surpris¬ 
ingly, the shift is not dramatic: 100 -anonymity does not 
appear much more difficult to reach than 2 -anonymity. 

4.2 The bad: aggregation does not work 

Unfortunately, the easy anonymizability suggested by 
the distributions is only apparent. Fig.[4]depicts the im¬ 
pact of spatiotemporal generalization on anonymizabil¬ 
ity: each curve maps to a different level of aggregation, 
from 100 meters and 1 minute (the finer granularity) to 
20 km and 8 hours. As one could expect, the curves 
are pushed towards smaller values of the anonymizabil¬ 
ity measure. However, the reduction of spatiotemporal 
precision does not have the desired magnitude, and even 
a coarse-grained citywide, 8 -hour aggregation cannot 2 - 




Figure 4: CDF of the anonymizability measure, under 
the 2 -anonymity criterion and for varying spatiotem¬ 
poral aggregation levels, in the d4d-civ and d4d-sen 
mobile traffic datasets. The legend reports the level of 
spatial (in kilometers) and temporal (in minutes) ag¬ 
gregation each curve refers to. 

anonymize but 30% of the mobile users. 

This result is again in agreement with previous stud¬ 
ies M, and confirms that mobile traffic datasets are 
difficult to anonymize. 

4.3 The why: long-tailed temporal diversity 

We are interested in understanding the reasons be¬ 
hind the incongruity above, i.e., the fact that spatiotem¬ 
poral aggregation yields such poor performance, even if 
the average effort needed to attain fc-anonymity is in 
theory not elevate. 

To attain our goal, we proceed along two directions. 
First, we separate the spatial and temporal dimensions 
of the measure in ([5]), so as to understand their precise 
contribution to the dataset anonymizability. Second, we 
measure the statistical dispersion of the fingerprint dis¬ 
tances along the two dimensions: the rationale is that 
we observed the average distance among fingerprints to 
be quite small, thus the reason of the low anonymizabil¬ 
ity must lie in the deviation of sample distances around 
that mean. 

4.3.1 Impact of space and time dimensions 

Formally, we consider, for each user a in the dataset, 
the set of fc -1 other subscribers whose fingerprints 
are the closest to that of a, according to ([5]). Then, we 
disaggregate all the fingerprint distances A a b between 
a and the users b € N(j -1 into sample distances d a b , as 
per ®. Finally, we separately collect the spatial and 
temporal components of all such sample distances, in 
0 , into ordered sets = {w s 5 s } and 
The resulting sets can be treated as disjoint distribu¬ 
tions of the distances, along the spatial and temporal 
dimensions, between the fingerprint of a generic indi¬ 
vidual a and those of the k -1 other users that show the 
most similar patterns to his. 

Examples of the spatial and temporal distance dis¬ 
tributions we obtain in the case of 2 -anonymity are 
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Distance Distance Distance Distance Distance Weight 


(a) d4d-civ, id 370 (b) d4d-civ, id 224 (c) d4d-civ, id 3175 (d) d4d-sen, id 658 (e) d4d-sen, id 2130 (f) Time weight 

Figure 5: (a)-(e) CDF of the sample distance, and of its spatial and temporal components, under the 2-anonymity 
criterion, for five random mobile users in the d4d-civ and d4d-sen mobile traffic datasets, (f) Contribution of the 
temporal components to the total sample distance, expressed as the ratio between the sums of temporal component 
distances and spatial component distances. 






Figure 6: (a,c) CDF of the Gini coefficient computed on the sample distance distributions of all users in the d4d-civ 
and d4d-sen datasets, for the 2-anonymity criterion. (b,d) CDF of the Tail weight index computed on the sample 
distance distributions of all users in the d4d-civ and d4d-sen datasets, for the 2-anonymity criterion. 


shown in Fig. l5all5el Each plot refers to one random 
user in the d4d-civ or d4d-sen dataset, and portrays 
the CDF of the spatial ( w s 6 s ) and temporal ( wtSt ) com¬ 
ponent distance, as well as that of the total sample dis¬ 
tance ( d ). We can remark that temporal components 
typically bring a significantly larger contribution to the 
total fingerprint distance than spatial ones. In fact, a 
significant portion of the spatial components is at zero 
distance, i.e., is immediately 2-anonymous in the orig¬ 
inal dataset. The same is not true for the temporal 
components. 

A rigorous confirmation is provided in Fig. [5fJ which 
shows the distribution of the temporal-to-spatial com¬ 
ponent ratios, i.e., ^ T)e W t^t/Yls k w s^s, for all subscribers 
a in the two reference datasets. The CDF is skewed to¬ 
wards high values, and for half of mobile subscribers 
in both d4d-civ or d4d-sen datasets temporal compo¬ 
nents contribute to 80% or more of the total sample 
distance. We conclude that the temporal component of 
a mobile traffic fingerprint is much harder to anonymize 
than the spatial one. In other words, where an individ¬ 
ual generates mobile traffic activity is easily masked, 
but hiding when he carries out such activity it is not so. 

4.3.2 Dispersion of fingerprint sample distances 

Not only temporal components weight much more 
than spatial ones in the fingerprint distance, but they 


also seem to show longer tails in Fig. lijallSel Longer 
tails imply the presence of more samples with a large 
distance: this, in turn, significantly increases the level 
of aggregation needed to achieve fc-anyonimity, as the 
latter is only granted once all samples in the fingerprint 
have zero distance from those in the second fingerprint. 

We rigorously evaluate the presence of a long tail of 
hard-to-anonymize samples by means of two comple¬ 
mentary metrics, still separating their spatial and tem¬ 
poral components. The first metric is the Gini coeffi¬ 
cient, which measures the dispersion of a distribution 
around its mean. Considering an ordered set § = {s*}, 
the coefficient is computed as 


G (§) = 1 - 


2Ef=m 


N 

Ei=l S i 


NZti * 


( 6 ) 


where N is the cardinality of S. We compute the Gini 
coefficient on the sets §% and Tjj, for all users a. 

The second metric is the Tail weight index [12] , which 
quantifies the weight of the tail of a distribution with 
empirical CDF F as 


F- 1 (0.99) - F- 1 (0.5) (0.75) - &- 1 (0.5) 

F ~ F- 1 (0.75) - F- 1 (0.5) (0.99) - (0.5)' 

(7) 

In the expression above, F 1 (-) is the inverse function 
of the empirical CDF and C E >_1 (-) is the inverse function 
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of a standard normal CDF. We compute again the Tail 
weight index on the distributions obtained from both 
§£ and T^, for all a. 

Fig. [G] shows the results returned by the two metrics 
in the d4d-civ or d4d-sen datasets. No significant dif¬ 
ferences emerge among the two mobile traffic datasets. 
In both cases, the Gini coefficient, in Fig. Itlcl and Fig. [Sal 
has, for all mobile user fingerprints (d), high values 
around 0.5 that denote significant dispersion around the 
mean. However, two opposite behaviors are observed 
for the spatial ( w s S s ) and temporal ( WtSt ) components. 
The former show cases where no dispersion at all is 
recorded (coefficient close to zero), and cases where the 
distribution is very sparse. The latter has the same 
behavior as the overall distance, with values clustered 
around 0.5. The result (i) corroborates the observation 
that the overall anonymizability is driven by distances 
along the temporal dimension, and (ii) imputes the lat¬ 
ter to the complete absence of easy-to-anonymize short 
tails in the distribution of temporal distances. 

Fig.[6d] and Fig. [6b] show instead the CDF of Tail 
weight indices. Here, the result is even more clear: the 
tail of temporal component distances is typically much 
longer than that of spatial ones, and in between those 
of exponential and heavy-tailed distribution^. Once 
more, the temporal component tail fundamentally shapes 
that of the overall fingerprint distance. 

5. DISCUSSION AND CONCLUSIONS 

At the light of all previous observations, we confirm 
the findings of previous works on user privacy preserva¬ 
tion in mobile traffic datasets. Namely, the two datasets 
we analysed do not grant fc-anonymity, not even for 
the minimum k = 2. Moreover, our reference datasets 
show poor anonymizability, i.e., require important spa¬ 
tial and temporal generalization in order to slightly im¬ 
prove user privacy. The fact that these properties have 
been independently verified across diverse datasets of 
mobile traffic suggests that the elevate uniqueness of 
trajectories and low anonymizability are intrinsic prop¬ 
erties of this type of datasets. 

In our case, even a citywide, 8-hour aggregation is 
not sufficient to ensure complete 2-anonymity to all sub¬ 
scribers. The result is even worse than that observed in 
previous studies: the difference is due to the fact that 
we consider the anonymization of complete subscriber 
fingerprints, whereas past works focus on simpler obfus¬ 
cation of summaries [4] or subsets [5] of the fingerprints. 

Our analysis also unveiled the reasons behind the 
poor anonymizability of the mobile traffic datasets we 
consider, as follows. 

On the one hand, the typical mobile user fingerprint 

4 As a reference, an exponential distribution with mean equal 
to 1 has a Tail weight index of 1.6, and a Pareto distribution 
with shape 1 has an Tail weight index of 14. 


in such datasets is composed of many spatiotemporal 
samples that are easily hidden among those of other 
users in the dataset. This leads to fingerprints that ap¬ 
pear easily anonymizable, since their samples can be 
matched, on average , with minimal spatial and tempo¬ 
ral aggregation. 

On the other hand, mobile traffic fingerprints tend 
to have a non-negligible number of elements that are 
much more difficult to anonymize than the average sam¬ 
ple. These elements, which determine a characteristic 
dispersion and long-tail behavior in the distribution of 
fingerprint sample distances, are mainly due to a signif¬ 
icant diversity along the temporal dimension. In other 
words, mobile users may have similar spatial finger¬ 
prints, but their temporal patterns typically contain a 
non-negligible number of dissimilar points. 

It is the presence of these hard-to-anonymize elements 
in the fingerprint that makes spatiotemporal aggrega¬ 
tion scarcely effective in attaining anonymity. Indeed, 
in order to anonymize a user, one needs to aggregate 
over space and time, until all his long-tail samples are 
hidden within the fingerprints of other subscribers. As 
a result, even significant reductions of granularity (and 
consequent information losses) may not be sufficient to 
ensure non-uniqueness in mobile traffic datasets. 

As a concluding remark, we recall that such unique¬ 
ness does not implies direct identifiability of mobile 
users, which is much harder to achieve and requires, in 
any case, cross-correlation with non-anonymized datasets. 
Instead, uniqueness is a first step towards re-identification. 
Understanding its nature can help developing mobile 
traffic datasets that are even more privacy-preserving, 
and thus more easily accessible. 
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